In this tutorial, we will apply CrowdTruth metrics to a sparse multiple choice crowdsourcing task for Event Extraction from sentences. The workers were asked to read a sentence and then pick from a multiple choice list which are the words or words phrases in the sentence that are events or actions. The options available in the multiple choice list change with the input sentence. The task was executed on FigureEight. For more crowdsourcing annotation task examples, click here.
In this tutorial, we will also show how to translate an open task to a closed task by processing both the input units and the annotations of a crowdsourcing task, and how this impacts the results of the CrowdTruth quality metrics. We start with an open-ended extraction task, where the crowd was asked to read a sentence and then pick from a multiple choice list which are the words or words phrases in the sentence that are events or actions.
To replicate this experiment, the code used to design and implement this crowdsourcing annotation template is available here: template, css, javascript.
This is a screenshot of the task as it appeared to workers:
A sample dataset for this task is available in this file, containing raw output from the crowd on FigureEight. Download the file and place it in a folder named data
that has the same root as this notebook. Now you can check your data:
In [1]:
import pandas as pd
test_data = pd.read_csv("../data/event-text-sparse-multiple-choice.csv")
test_data.head()
Out[1]:
In [2]:
import crowdtruth
from crowdtruth.configuration import DefaultConfig
Our test class inherits the default configuration DefaultConfig
, while also declaring some additional attributes that are specific to the Relation Extraction task:
inputColumns
: list of input columns from the .csv file with the input dataoutputColumns
: list of output columns from the .csv file with the answers from the workersannotation_separator
: string that separates between the crowd annotations in outputColumns
open_ended_task
: boolean variable defining whether the task is open-ended (i.e. the possible crowd annotations are not known beforehand, like in the case of free text input); in the task that we are processing, workers pick the answers from a pre-defined list, therefore the task is not open ended, and this variable is set to False
annotation_vector
: list of possible crowd answers, mandatory to declare when open_ended_task
is False
; for our task, this is the list of all relations that were given as input to the crowd in at least one sentenceprocessJudgments
: method that defines processing of the raw crowd data; for this task, we process the crowd answers to correspond to the values in annotation_vector
The complete configuration class is declared below:
In [3]:
class TestConfig(DefaultConfig):
inputColumns = ["doc_id", "events", "events_count", "original_sentence", "processed_sentence", "sentence_id", "tokens"]
outputColumns = ["selected_events"]
annotation_separator = ","
# processing of a closed task
open_ended_task = True
def processJudgments(self, judgments):
# pre-process output to match the values in annotation_vector
for col in self.outputColumns:
# transform to lowercase
judgments[col] = judgments[col].apply(lambda x: str(x).lower())
# remove square brackets from annotations
judgments[col] = judgments[col].apply(lambda x: str(x).replace('[',''))
judgments[col] = judgments[col].apply(lambda x: str(x).replace(']',''))
# remove the quotes around the annotations
judgments[col] = judgments[col].apply(lambda x: str(x).replace('"',''))
return judgments
In [4]:
data_open, config = crowdtruth.load(
file = "../data/event-text-sparse-multiple-choice.csv",
config = TestConfig()
)
data_open['judgments'].head()
Out[4]:
In [6]:
results_open = crowdtruth.run(data_open, config)
results
is a dict object that contains the quality metrics for sentences, events and crowd workers.
The sentence metrics are stored in results["units"]
:
In [7]:
results_open["units"].head()
Out[7]:
The uqs
column in results["units"]
contains the sentence quality scores, capturing the overall workers agreement over each sentence. Here we plot its histogram:
In [8]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(results_open["units"]["uqs"])
plt.xlabel("Sentence Quality Score")
plt.ylabel("Sentences")
Out[8]:
The unit_annotation_score
column in results["units"]
contains the sentence-relation scores, capturing the likelihood that a relation is expressed in a sentence. For each sentence, we store a dictionary mapping each relation to its sentence-relation score.
In [9]:
results_open["units"]["unit_annotation_score"].head(10)
Out[9]:
The worker metrics are stored in results["workers"]
:
In [10]:
results_open["workers"].head()
Out[10]:
The wqs
columns in results["workers"]
contains the worker quality scores, capturing the overall agreement between one worker and all the other workers.
In [27]:
plt.hist(results_open["workers"]["wqs"])
plt.xlabel("Worker Quality Score")
plt.ylabel("Workers")
Out[27]:
The goal of this crowdsourcing task is to understand how clearly a word or a word phrase is expressing an event or an action across all the sentences in the dataset and not at the level of a single sentence as previously. Therefore, in the remainder of this tutorial we show how to translate an open task to a closed task by processing both the input units and the annotations of a crowdsourcing task.
The answers from the crowd are stored in the selected_events
column.
In [28]:
test_data["selected_events"][0:30]
Out[28]:
As you already know, each word can be expressed in a canonical form, i.e., as a lemma. For example, the words: run, runs, running, they all have the lemma run. As you can see in the previous cell, events in text can appear under multiple forms. To evaluate the clarity of each event, we will process both the input units and the crowd annotations to refer to a word in its canonical form, i.e., we will lemmatize them.
Following, we define the function used to lemmatize the options that are shown to the workers in the crowdsourcing task:
In [29]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
def nltk2wn_tag(nltk_tag):
if nltk_tag.startswith('J'):
return wordnet.ADJ
elif nltk_tag.startswith('V'):
return wordnet.VERB
elif nltk_tag.startswith('N'):
return wordnet.NOUN
elif nltk_tag.startswith('R'):
return wordnet.ADV
else:
return None
def lemmatize_events(event):
nltk_tagged = nltk.pos_tag(nltk.word_tokenize(str(event.lower().split("__")[0])))
wn_tagged = map(lambda x: (str(x[0]), nltk2wn_tag(x[1])), nltk_tagged)
res_words = []
for word, tag in wn_tagged:
if tag is None:
res_word = wordnet._morphy(str(word), wordnet.NOUN)
if res_word == []:
res_words.append(str(word))
else:
if len(res_word) == 1:
res_words.append(str(res_word[0]))
else:
res_words.append(str(res_word[1]))
else:
res_word = wordnet._morphy(str(word), tag)
if res_word == []:
res_words.append(str(word))
else:
if len(res_word) == 1:
res_words.append(str(res_word[0]))
else:
res_words.append(str(res_word[1]))
lematized_keyword = " ".join(res_words)
return lematized_keyword
In [9]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
def nltk2wn_tag(nltk_tag):
if nltk_tag.startswith('J'):
return wordnet.ADJ
elif nltk_tag.startswith('V'):
return wordnet.VERB
elif nltk_tag.startswith('N'):
return wordnet.NOUN
elif nltk_tag.startswith('R'):
return wordnet.ADV
else:
return None
The following functions create the values of the annotation vector and extracts the lemma of the events selected by each worker.
In [30]:
def define_annotation_vector(eventsList):
events = []
for i in range(len(eventsList)):
currentEvents = eventsList[i].split("###")
for j in range(len(currentEvents)):
if currentEvents[j] != "no_event":
lematized_keyword = lemmatize_events(currentEvents[j])
if lematized_keyword not in events:
events.append(lematized_keyword)
events.append("no_event")
return events
def lemmatize_keywords(keywords, separator):
keywords_list = keywords.split(separator)
lematized_keywords = []
for keyword in keywords_list:
lematized_keyword = lemmatize_events(keyword)
lematized_keywords.append(lematized_keyword)
return separator.join(lematized_keywords)
In [31]:
class TestConfig(DefaultConfig):
inputColumns = ["doc_id", "events", "events_count", "original_sentence", "processed_sentence", "sentence_id", "tokens"]
outputColumns = ["selected_events"]
annotation_separator = ","
# processing of a closed task
open_ended_task = False
annotation_vector = define_annotation_vector(test_data["events"])
def processJudgments(self, judgments):
# pre-process output to match the values in annotation_vector
for col in self.outputColumns:
# transform to lowercase
judgments[col] = judgments[col].apply(lambda x: str(x).lower())
# remove square brackets from annotations
judgments[col] = judgments[col].apply(lambda x: str(x).replace("[",""))
judgments[col] = judgments[col].apply(lambda x: str(x).replace("]",""))
# remove the quotes around the annotations
judgments[col] = judgments[col].apply(lambda x: str(x).replace('"',''))
judgments[col] = judgments[col].apply(lambda x: lemmatize_keywords(str(x), self.annotation_separator))
return judgments
In [32]:
data_closed, config = crowdtruth.load(
file = "data/event-text-sparse-multiple-choice.csv",
config = TestConfig()
)
data_closed['judgments'].head()
Out[32]:
In [36]:
results_closed = crowdtruth.run(data_closed, config)
In [37]:
results_closed["annotations"]
Out[37]:
In [39]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.scatter(
results["units"]["uqs"],
results_closed["units"]["uqs"],
)
plt.plot([0, 1], [0, 1], 'red', linewidth=1)
plt.title("Sentence Quality Score")
plt.xlabel("open task")
plt.ylabel("closed task")
Out[39]:
In [41]:
plt.scatter(
results["workers"]["wqs"],
results_closed["workers"]["wqs"],
)
plt.plot([0, 1], [0, 1], 'red', linewidth=1)
plt.title("Worker Quality Score")
plt.xlabel("open task")
plt.ylabel("closed task")
Out[41]: